Model-augmented Prioritized Experience Replay (MaPER)

1 Overview

Model-augmented Prioritized Experience Replay (MaPER), which was proposed by Y. Oh et al.1, extends critic network in order to predict Q-value better. The critic network, Model-augumented Critic Network (MaCN), predicts not only Q-value but also reward and next state with shared weights.

In MaPER, the combination of TD error and model prediction errors is considered as transition priority \(\sigma_i = \xi_Q |\delta Q_{\theta}|_{MSE} + \xi_R |\delta R_{\theta}|_{MSE} + \xi_S |\delta S_{\theta}|_{MSE}\).

The coefficients are adaptively changed with following rule;

\[ \xi_j = \frac{1}{Z}\exp \left ( \frac{\mathcal{L}_j^{t-1}}{\mathcal{L}_j^{t-2}T}\right )~\text{where}~j=Q,R,S \]

MaPER works as a kind of curriculum learning. It starts learning from transitions with high prediction errors. Then, after learning model dynamism, transitions with large TD error are used. According to their experiments, TD error decreases faster and estimated Q values are match to the returns (aka. episode rewards) well.

2 With cpprb

You can implement MaPER with PrioritizedReplayBuffer.

  1. Y. Oh et al., “Model-augmented Prioritized Experience Replay”, ICLR (2022) ↩︎